Computation and Language 37
☆ Gender-specific Machine Translation with Large Language Models
Decoder-only Large Language Models (LLMs) have demonstrated potential in
machine translation (MT), albeit with performance slightly lagging behind
traditional encoder-decoder Neural Machine Translation (NMT) systems. However,
LLMs offer a unique advantage: the ability to control the properties of the
output through prompts. In this study, we harness this flexibility to explore
LLaMa's capability to produce gender-specific translations for languages with
grammatical gender. Our results indicate that LLaMa can generate
gender-specific translations with competitive accuracy and gender bias
mitigation when compared to NLLB, a state-of-the-art multilingual NMT system.
Furthermore, our experiments reveal that LLaMa's translations are robust,
showing significant performance drops when evaluated against opposite-gender
references in gender-ambiguous datasets but maintaining consistency in less
ambiguous contexts. This research provides insights into the potential and
challenges of using LLMs for gender-specific translations and highlights the
importance of in-context learning to elicit new tasks in LLMs.
☆ GPT-InvestAR: Enhancing Stock Investment Strategies through Annual Report Analysis with Large Language Models
Annual Reports of publicly listed companies contain vital information about
their financial health which can help assess the potential impact on Stock
price of the firm. These reports are comprehensive in nature, going up to, and
sometimes exceeding, 100 pages. Analysing these reports is cumbersome even for
a single firm, let alone the whole universe of firms that exist. Over the
years, financial experts have become proficient in extracting valuable
information from these documents relatively quickly. However, this requires
years of practice and experience. This paper aims to simplify the process of
assessing Annual Reports of all the firms by leveraging the capabilities of
Large Language Models (LLMs). The insights generated by the LLM are compiled in
a Quant styled dataset and augmented by historical stock price data. A Machine
Learning model is then trained with LLM outputs as features. The walkforward
test results show promising outperformance wrt S&P500 returns. This paper
intends to provide a framework for future work in this direction. To facilitate
this, the code has been released as open source.
☆ J-Guard: Journalism Guided Adversarially Robust Detection of AI-generated News AACL 2023
Tharindu Kumarage, Amrita Bhattacharjee, Djordje Padejski, Kristy Roschke, Dan Gillmor, Scott Ruston, Huan Liu, Joshua Garland
The rapid proliferation of AI-generated text online is profoundly reshaping
the information landscape. Among various types of AI-generated text,
AI-generated news presents a significant threat as it can be a prominent source
of misinformation online. While several recent efforts have focused on
detecting AI-generated text in general, these methods require enhanced
reliability, given concerns about their vulnerability to simple adversarial
attacks. Furthermore, due to the eccentricities of news writing, applying these
detection methods for AI-generated news can produce false positives,
potentially damaging the reputation of news organizations. To address these
challenges, we leverage the expertise of an interdisciplinary team to develop a
framework, J-Guard, capable of steering existing supervised AI text detectors
for detecting AI-generated news while boosting adversarial robustness. By
incorporating stylistic cues inspired by the unique journalistic attributes,
J-Guard effectively distinguishes between real-world journalism and
AI-generated news articles. Our experiments on news articles generated by a
vast array of AI models, including ChatGPT (GPT3.5), demonstrate the
effectiveness of J-Guard in enhancing detection capabilities while maintaining
an average performance decrease of as low as 7% when faced with adversarial
attacks.
comment: This Paper is Accepted to The 13th International Joint Conference on
Natural Language Processing and the 3rd Conference of the Asia-Pacific
Chapter of the Association for Computational Linguistics (IJCNLP-AACL 2023)
☆ Everyone Deserves A Reward: Learning Customized Human Preferences
Reward models (RMs) are crucial in aligning large language models (LLMs) with
human preferences for improving interaction quality. However, the real world is
pluralistic, which leads to diversified human preferences based on different
religions, politics, cultures, etc. Moreover, each individual can have their
own unique preferences on various topics. Neglecting the diversity of human
preferences, current LLM training processes only use a general reward model,
which is below satisfaction for customized or personalized application
scenarios. To explore customized preference learning, we collect a
domain-specific preference (DSP) dataset, which collects preferred responses to
each given query from four practical domains. Besides, from the perspective of
data efficiency, we proposed a three-stage customized RM learning scheme, whose
effectiveness is empirically verified on both general preference datasets and
our DSP set. Furthermore, we test multiple training and data strategies on the
three learning stages, and have found several ways to better preserve the
general preferring ability while training the customized RMs, especially
general preference enrichment and customized preference imitation learning. The
DSP dataset and code are available at https://github.com/Linear95/DSP.
☆ Knowledge Solver: Teaching LLMs to Search for Domain Knowledge from Knowledge Graphs
Large language models (LLMs), such as ChatGPT and GPT-4, are versatile and
can solve different tasks due to their emergent ability and generalizability.
However, LLMs sometimes lack domain-specific knowledge to perform tasks, which
would also cause hallucination during inference. In some previous works,
additional modules like graph neural networks (GNNs) are trained on retrieved
knowledge from external knowledge bases, aiming to mitigate the problem of
lacking domain-specific knowledge. However, incorporating additional modules:
1) would need retraining additional modules when encountering novel domains; 2)
would become a bottleneck since LLMs' strong abilities are not fully utilized
for retrieval. In this paper, we propose a paradigm, termed Knowledge Solver
(KSL), to teach LLMs to search for essential knowledge from external knowledge
bases by harnessing their own strong generalizability. Specifically, we design
a simple yet effective prompt to transform retrieval into a multi-hop decision
sequence, which empowers LLMs with searching knowledge ability in zero-shot
manner. Additionally, KSL is able to provide complete retrieval paths and
therefore increase explainability of LLMs' reasoning processes. We conduct
experiments on three datasets: CommonsenseQA, OpenbookQA, and MedQA-USMLE, and
found that our approach improves LLM baseline performance by a relatively large
margin.
☆ ContrastWSD: Enhancing Metaphor Detection with Word Sense Disambiguation Following the Metaphor Identification Procedure
This paper presents ContrastWSD, a RoBERTa-based metaphor detection model
that integrates the Metaphor Identification Procedure (MIP) and Word Sense
Disambiguation (WSD) to extract and contrast the contextual meaning with the
basic meaning of a word to determine whether it is used metaphorically in a
sentence. By utilizing the word senses derived from a WSD model, our model
enhances the metaphor detection process and outperforms other methods that rely
solely on contextual embeddings or integrate only the basic definitions and
other external knowledge. We evaluate our approach on various benchmark
datasets and compare it with strong baselines, indicating the effectiveness in
advancing metaphor detection.
comment: 10 pages, 2 figures
☆ A Multimodal Analysis of Influencer Content on Twitter AACL 2023
Influencer marketing involves a wide range of strategies in which brands
collaborate with popular content creators (i.e., influencers) to leverage their
reach, trust, and impact on their audience to promote and endorse products or
services. Because followers of influencers are more likely to buy a product
after receiving an authentic product endorsement rather than an explicit direct
product promotion, the line between personal opinions and commercial content
promotion is frequently blurred. This makes automatic detection of regulatory
compliance breaches related to influencer advertising (e.g., misleading
advertising or hidden sponsorships) particularly difficult. In this work, we
(1) introduce a new Twitter (now X) dataset consisting of 15,998 influencer
posts mapped into commercial and non-commercial categories for assisting in the
automatic detection of commercial influencer content; (2) experiment with an
extensive set of predictive models that combine text and visual information
showing that our proposed cross-attention approach outperforms state-of-the-art
multimodal models; and (3) conduct a thorough analysis of strengths and
limitations of our models. We show that multimodal modeling is useful for
identifying commercial posts, reducing the amount of false positives, and
capturing relevant context that aids in the discovery of undisclosed commercial
posts.
comment: Accepted at AACL 2023
☆ Persona-aware Generative Model for Code-mixed Language
Code-mixing and script-mixing are prevalent across online social networks and
multilingual societies. However, a user's preference toward code-mixing depends
on the socioeconomic status, demographics of the user, and the local context,
which existing generative models mostly ignore while generating code-mixed
texts. In this work, we make a pioneering attempt to develop a persona-aware
generative model to generate texts resembling real-life code-mixed texts of
individuals. We propose a Persona-aware Generative Model for Code-mixed
Generation, PARADOX, a novel Transformer-based encoder-decoder model that
encodes an utterance conditioned on a user's persona and generates code-mixed
texts without monolingual reference data. We propose an alignment module that
re-calibrates the generated sequence to resemble real-life code-mixed texts.
PARADOX generates code-mixed texts that are semantically more meaningful and
linguistically more valid. To evaluate the personification capabilities of
PARADOX, we propose four new metrics -- CM BLEU, CM Rouge-1, CM Rouge-L and CM
KS. On average, PARADOX achieves 1.6 points better CM BLEU, 47% better
perplexity and 32% better semantic coherence than the non-persona-based
counterparts.
comment: 4 tables, 4 figures
☆ Leave no Place Behind: Improved Geolocation in Humanitarian Documents
Geographical location is a crucial element of humanitarian response,
outlining vulnerable populations, ongoing events, and available resources.
Latest developments in Natural Language Processing may help in extracting vital
information from the deluge of reports and documents produced by the
humanitarian sector. However, the performance and biases of existing
state-of-the-art information extraction tools are unknown. In this work, we
develop annotated resources to fine-tune the popular Named Entity Recognition
(NER) tools Spacy and roBERTa to perform geotagging of humanitarian texts. We
then propose a geocoding method FeatureRank which links the candidate locations
to the GeoNames database. We find that not only does the humanitarian-domain
data improves the performance of the classifiers (up to F1 = 0.92), but it also
alleviates some of the bias of the existing tools, which erroneously favor
locations in the Western countries. Thus, we conclude that more resources from
non-Western documents are necessary to ensure that off-the-shelf NER systems
are suitable for the deployment in the humanitarian sector.
☆ On the Challenges of Building Datasets for Hate Speech Detection
Detection of hate speech has been formulated as a standalone application of
NLP and different approaches have been adopted for identifying the target
groups, obtaining raw data, defining the labeling process, choosing the
detection algorithm, and evaluating the performance in the desired setting.
However, unlike other downstream tasks, hate speech suffers from the lack of
large-sized, carefully curated, generalizable datasets owing to the highly
subjective nature of the task. In this paper, we first analyze the issues
surrounding hate speech detection through a data-centric lens. We then outline
a holistic framework to encapsulate the data creation pipeline across seven
broad dimensions by taking the specific example of hate speech towards sexual
minorities. We posit that practitioners would benefit from following this
framework as a form of best practice when creating hate speech datasets in the
future.
comment: 12 pages, 1 figure
☆ ViCGCN: Graph Convolutional Network with Contextualized Language Models for Social Media Mining in Vietnamese
Social media processing is a fundamental task in natural language processing
with numerous applications. As Vietnamese social media and information science
have grown rapidly, the necessity of information-based mining on Vietnamese
social media has become crucial. However, state-of-the-art research faces
several significant drawbacks, including imbalanced data and noisy data on
social media platforms. Imbalanced and noisy are two essential issues that need
to be addressed in Vietnamese social media texts. Graph Convolutional Networks
can address the problems of imbalanced and noisy data in text classification on
social media by taking advantage of the graph structure of the data. This study
presents a novel approach based on contextualized language model (PhoBERT) and
graph-based method (Graph Convolutional Networks). In particular, the proposed
approach, ViCGCN, jointly trained the power of Contextualized embeddings with
the ability of Graph Convolutional Networks, GCN, to capture more syntactic and
semantic dependencies to address those drawbacks. Extensive experiments on
various Vietnamese benchmark datasets were conducted to verify our approach.
The observation shows that applying GCN to BERTology models as the final layer
significantly improves performance. Moreover, the experiments demonstrate that
ViCGCN outperforms 13 powerful baseline models, including BERTology models,
fusion BERTology and GCN models, other baselines, and SOTA on three benchmark
social media datasets. Our proposed ViCGCN approach demonstrates a significant
improvement of up to 6.21%, 4.61%, and 2.63% over the best Contextualized
Language Models, including multilingual and monolingual, on three benchmark
datasets, UIT-VSMEC, UIT-ViCTSD, and UIT-VSFC, respectively. Additionally, our
integrated model ViCGCN achieves the best performance compared to other
BERTology integrated with GCN models.
☆ A deep Natural Language Inference predictor without language-specific training data
In this paper we present a technique of NLP to tackle the problem of
inference relation (NLI) between pairs of sentences in a target language of
choice without a language-specific training dataset. We exploit a generic
translation dataset, manually translated, along with two instances of the same
pre-trained model - the first to generate sentence embeddings for the source
language, and the second fine-tuned over the target language to mimic the
first. This technique is known as Knowledge Distillation. The model has been
evaluated over machine translated Stanford NLI test dataset, machine translated
Multi-Genre NLI test dataset, and manually translated RTE3-ITA test dataset. We
also test the proposed architecture over different tasks to empirically
demonstrate the generality of the NLI task. The model has been evaluated over
the native Italian ABSITA dataset, on the tasks of Sentiment Analysis,
Aspect-Based Sentiment Analysis, and Topic Recognition. We emphasise the
generality and exploitability of the Knowledge Distillation technique that
outperforms other methodologies based on machine translation, even though the
former was not directly trained on the data it was tested over.
comment: Conference: ICIAP2023
☆ Aligning Large Language Models for Clinical Tasks
Large Language Models (LLMs) have demonstrated remarkable adaptability,
showcasing their capacity to excel in tasks for which they were not explicitly
trained. However, despite their impressive natural language processing (NLP)
capabilities, effective alignment of LLMs remains a crucial challenge when
deploying them for specific clinical applications. The ability to generate
responses with factually accurate content and to engage in non-trivial
reasoning steps are crucial for the LLMs to be eligible for applications in
clinical medicine. Employing a combination of techniques including
instruction-tuning and in-prompt strategies like few-shot and chain of thought
prompting has significantly enhanced the performance of LLMs. Our proposed
alignment strategy for medical question-answering, known as
'expand-guess-refine', offers a parameter and data-efficient solution. A
preliminary analysis of this method demonstrated outstanding performance,
achieving a score of 70.63% on a subset of questions sourced from the USMLE
dataset.
comment: 10 papers, 3 figures
☆ Promoting Open-domain Dialogue Generation through Learning Pattern Information between Contexts and Responses
Recently, utilizing deep neural networks to build the opendomain dialogue
models has become a hot topic. However, the responses generated by these models
suffer from many problems such as responses not being contextualized and tend
to generate generic responses that lack information content, damaging the
user's experience seriously. Therefore, many studies try introducing more
information into the dialogue models to make the generated responses more vivid
and informative. Unlike them, this paper improves the quality of generated
responses by learning the implicit pattern information between contexts and
responses in the training samples. In this paper, we first build an open-domain
dialogue model based on the pre-trained language model (i.e., GPT-2). And then,
an improved scheduled sampling method is proposed for pre-trained models, by
which the responses can be used to guide the response generation in the
training phase while avoiding the exposure bias problem. More importantly, we
design a response-aware mechanism for mining the implicit pattern information
between contexts and responses so that the generated replies are more diverse
and approximate to human replies. Finally, we evaluate the proposed model (RAD)
on the Persona-Chat and DailyDialog datasets; and the experimental results show
that our model outperforms the baselines on most automatic and manual metrics.
☆ Agent-based simulation of pedestrians' earthquake evacuation; application to Beirut, Lebanon
Rouba Iskandar, Kamel Allaw, Julie Dugdale, Elise Beck, Jocelyne Adjizian-Gérard, Cécile Cornou, Jacques Harb, Pascal Lacroix, Nada Badaro-Saliba, Stéphane Cartier, Rita Zaarour
Most seismic risk assessment methods focus on estimating the damages to the
built environment and the consequent socioeconomic losses without fully taking
into account the social aspect of risk. Yet, human behaviour is a key element
in predicting the human impact of an earthquake, therefore, it is important to
include it in quantitative risk assessment studies. In this study, an
interdisciplinary approach simulating pedestrians' evacuation during
earthquakes at the city scale is developed using an agent-based model. The
model integrates the seismic hazard, the physical vulnerability as well as
individuals' behaviours and mobility. The simulator is applied to the case of
Beirut, Lebanon. Lebanon is at the heart of the Levant fault system that has
generated several Mw>7 earthquakes, the latest being in 1759. It is one of the
countries with the highest seismic risk in the Mediterranean region. This is
due to the high seismic vulnerability of the buildings due to the absence of
mandatory seismic regulation until 2012, the high level of urbanization, and
the lack of adequate spatial planning and risk prevention policies. Beirut as
the main residential, economic and institutional hub of Lebanon is densely
populated. To accommodate the growing need for urban development, constructions
have almost taken over all of the green areas of the city; squares and gardens
are disappearing to give place to skyscrapers. However, open spaces are safe
places to shelter, away from debris, and therefore play an essential role in
earthquake evacuation. Despite the massive urbanization, there are a few open
spaces but locked gates and other types of anthropogenic barriers often limit
their access. To simulate this complex context, pedestrians' evacuation
simulations are run in a highly realistic spatial environment implemented in
GAMA [1]. Previous data concerning soil and buildings in Beirut [2, 3] are
complemented by new geographic data extracted from high-resolution Pleiades
satellite images. The seismic loading is defined as a peak ground acceleration
of 0.3g, as stated in Lebanese seismic regulations. Building damages are
estimated using an artificial neural network trained to predict the mean damage
[4] based on the seismic loading as well as the soil and building vibrational
properties [5]. Moreover, the quantity and the footprint of the generated
debris around each building are also estimated and included in the model. We
simulate how topography, buildings, debris, and access to open spaces, affect
individuals' mobility. Two city configurations are implemented: 1. Open spaces
are accessible without any barriers; 2. Access to some open spaces is blocked.
The first simulation results show that while 52% of the population is able to
arrive to an open space within 5 minutes after an earthquake, this number is
reduced to 39% when one of the open spaces is locked. These results show that
the presence of accessible open spaces in a city and their proximity to the
residential buildings is a crucial factor for ensuring people's safety when an
earthquake occurs.
☆ Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
As the size of large language models (LLMs) continues to grow, model
compression without sacrificing accuracy has become a crucial challenge for
deployment. While some quantization methods, such as GPTQ, have made progress
in achieving acceptable 4-bit weight-only quantization, attempts at lower bit
quantization often result in severe performance degradation. In this paper, we
introduce a technique called norm tweaking, which can be used as a plugin in
current PTQ methods to achieve high precision while being cost-efficient. Our
approach is inspired by the observation that rectifying the quantized
activation distribution to match its float counterpart can readily restore
accuracy for LLMs. To achieve this, we carefully design a tweaking strategy
that includes calibration data generation and channel-wise distance constraint
to update the weights of normalization layers for better generalization. We
conduct extensive experiments on various datasets using several open-sourced
LLMs. Our method demonstrates significant improvements in both weight-only
quantization and joint quantization of weights and activations, surpassing
existing PTQ methods. On GLM-130B and OPT-66B, our method even achieves the
same level of accuracy at 2-bit quantization as their float ones. Our simple
and effective approach makes it more practical for real-world applications.
☆ GRASS: Unified Generation Model for Speech Semantic Understanding
This paper explores the instruction fine-tuning technique for speech semantic
understanding by introducing a unified end-to-end (E2E) framework that
generates semantic labels conditioned on a task-related prompt for audio data.
We pre-train the model using large and diverse data, where instruction-speech
pairs are constructed via a text-to-speech (TTS) system. Extensive experiments
demonstrate that our proposed model significantly outperforms state-of-the-art
(SOTA) models after fine-tuning downstream tasks. Furthermore, the proposed
model achieves competitive performance in zero-shot and few-shot scenarios. To
facilitate future work on instruction fine-tuning for speech-to-semantic tasks,
we release our instruction dataset and code.
☆ Improving Code Generation by Dynamic Temperature Sampling
Recently, Large Language Models (LLMs) have shown impressive results in code
generation. However, existing decoding strategies are designed for Natural
Language (NL) generation, overlooking the differences between NL and
programming languages (PL). Due to this oversight, a better decoding strategy
for code generation remains an open question. In this paper, we conduct the
first systematic study to explore a decoding strategy specialized in code
generation. With an analysis of loss distributions of code tokens, we find that
code tokens can be divided into two categories: challenging tokens that are
difficult to predict and confident tokens that can be easily inferred. Among
them, the challenging tokens mainly appear at the beginning of a code block.
Inspired by the above findings, we propose a simple yet effective method:
Adaptive Temperature (AdapT) sampling, which dynamically adjusts the
temperature coefficient when decoding different tokens. We apply a larger
temperature when sampling for challenging tokens, allowing LLMs to explore
diverse choices. We employ a smaller temperature for confident tokens avoiding
the influence of tail randomness noises. We apply AdapT sampling to LLMs with
different sizes and conduct evaluations on two popular datasets. Results show
that AdapT sampling significantly outperforms state-of-the-art decoding
strategy.
☆ Rubric-Specific Approach to Automated Essay Scoring with Augmentation Training
Neural based approaches to automatic evaluation of subjective responses have
shown superior performance and efficiency compared to traditional rule-based
and feature engineering oriented solutions. However, it remains unclear whether
the suggested neural solutions are sufficient replacements of human raters as
we find recent works do not properly account for rubric items that are
essential for automated essay scoring during model training and validation. In
this paper, we propose a series of data augmentation operations that train and
test an automated scoring model to learn features and functions overlooked by
previous works while still achieving state-of-the-art performance in the
Automated Student Assessment Prize dataset.
comment: 13 pages
☆ HC3 Plus: A Semantic-Invariant Human ChatGPT Comparison Corpus
ChatGPT has gained significant interest due to its impressive performance,
but people are increasingly concerned about its potential risks, particularly
around the detection of AI-generated content (AIGC), which is often difficult
for untrained humans to identify. Current datasets utilized for detecting
ChatGPT-generated text primarily center around question-answering, yet they
tend to disregard tasks that possess semantic-invariant properties, such as
summarization, translation, and paraphrasing. Our primary studies demonstrate
that detecting model-generated text on semantic-invariant tasks is more
difficult. To fill this gap, we introduce a more extensive and comprehensive
dataset that considers more types of tasks than previous work, including
semantic-invariant tasks. In addition, the model after a large number of task
instruction fine-tuning shows a strong powerful performance. Owing to its
previous success, we further instruct fine-tuning Tk-instruct and built a more
powerful detection system. Experimental results show that our proposed detector
outperforms the previous state-of-the-art RoBERTa-based detector.
☆ Large Language Models for Automated Open-domain Scientific Hypotheses Discovery
Hypothetical induction is recognized as the main reasoning type when
scientists make observations about the world and try to propose hypotheses to
explain those observations. Past research on hypothetical induction has a
limited setting that (1) the observation annotations of the dataset are not raw
web corpus but are manually selected sentences (resulting in a close-domain
setting); and (2) the ground truth hypotheses annotations are mostly
commonsense knowledge, making the task less challenging. In this work, we
propose the first NLP dataset for social science academic hypotheses discovery,
consisting of 50 recent papers published in top social science journals. Raw
web corpora that are necessary for developing hypotheses in the published
papers are also collected in the dataset, with the final goal of creating a
system that automatically generates valid, novel, and helpful (to human
researchers) hypotheses, given only a pile of raw web corpora. The new dataset
can tackle the previous problems because it requires to (1) use raw web corpora
as observations; and (2) propose hypotheses even new to humanity. A
multi-module framework is developed for the task, as well as three different
feedback mechanisms that empirically show performance gain over the base
framework. Finally, our framework exhibits high performance in terms of both
GPT-4 based evaluation and social science expert evaluation.
☆ Offensive Hebrew Corpus and Detection using BERT CCS
Offensive language detection has been well studied in many languages, but it
is lagging behind in low-resource languages, such as Hebrew. In this paper, we
present a new offensive language corpus in Hebrew. A total of 15,881 tweets
were retrieved from Twitter. Each was labeled with one or more of five classes
(abusive, hate, violence, pornographic, or none offensive) by Arabic-Hebrew
bilingual speakers. The annotation process was challenging as each annotator is
expected to be familiar with the Israeli culture, politics, and practices to
understand the context of each tweet. We fine-tuned two Hebrew BERT models,
HeBERT and AlephBERT, using our proposed dataset and another published dataset.
We observed that our data boosts HeBERT performance by 2% when combined with
D_OLaH. Fine-tuning AlephBERT on our data and testing on D_OLaH yields 69%
accuracy, while fine-tuning on D_OLaH and testing on our data yields 57%
accuracy, which may be an indication to the generalizability our data offers.
Our dataset and fine-tuned models are available on GitHub and Huggingface.
comment: 8 pages, 1 figure, The 20th ACS/IEEE International Conference on
Computer Systems and Applications (AICCSA)
☆ HAE-RAE Bench: Evaluation of Korean Knowledge in Language Models
Guijin Son, Hanwool Lee, Suwan Kim, Jaecheol Lee, Je Won Yeom, Jihyu Jung, Jung Woo Kim, Songseong Kim
Large Language Models (LLMs) pretrained on massive corpora exhibit remarkable
capabilities across a wide range of tasks, however, the attention given to
non-English languages has been limited in this field of research. To address
this gap and assess the proficiency of language models in the Korean language
and culture, we present HAE-RAE Bench, covering 6 tasks including vocabulary,
history, and general knowledge. Our evaluation of language models on this
benchmark highlights the potential advantages of employing Large
Language-Specific Models(LLSMs) over a comprehensive, universal model like
GPT-3.5. Remarkably, our study reveals that models approximately 13 times
smaller than GPT-3.5 can exhibit similar performance levels in terms of
language-specific knowledge retrieval. This observation underscores the
importance of homogeneous corpora for training professional-level
language-specific models. On the contrary, we also observe a perplexing
performance dip in these smaller LMs when they are tasked to generate
structured answers.
☆ Certifying LLM Safety against Adversarial Prompting
Large language models (LLMs) released for public use incorporate guardrails
to ensure their output is safe, often referred to as "model alignment." An
aligned language model should decline a user's request to produce harmful
content. However, such safety measures are vulnerable to adversarial prompts,
which contain maliciously designed token sequences to circumvent the model's
safety guards and cause it to produce harmful content. In this work, we
introduce erase-and-check, the first framework to defend against adversarial
prompts with verifiable safety guarantees. We erase tokens individually and
inspect the resulting subsequences using a safety filter. Our procedure labels
the input prompt as harmful if any subsequences or the input prompt are
detected as harmful by the filter. This guarantees that any adversarial
modification of a harmful prompt up to a certain size is also labeled harmful.
We defend against three attack modes: i) adversarial suffix, which appends an
adversarial sequence at the end of the prompt; ii) adversarial insertion, where
the adversarial sequence is inserted anywhere in the middle of the prompt; and
iii) adversarial infusion, where adversarial tokens are inserted at arbitrary
positions in the prompt, not necessarily as a contiguous block. Empirical
results demonstrate that our technique obtains strong certified safety
guarantees on harmful prompts while maintaining good performance on safe
prompts. For example, against adversarial suffixes of length 20, it certifiably
detects 93% of the harmful prompts and labels 94% of the safe prompts as safe
using the open source language model Llama 2 as the safety filter.
☆ A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models
Key to tasks that require reasoning about natural language in visual contexts
is grounding words and phrases to image regions. However, observing this
grounding in contemporary models is complex, even if it is generally expected
to take place if the task is addressed in a way that is conductive to
generalization. We propose a framework to jointly study task performance and
phrase grounding, and propose three benchmarks to study the relation between
the two. Our results show that contemporary models demonstrate inconsistency
between their ability to ground phrases and solve tasks. We show how this can
be addressed through brute-force training on ground phrasing annotations, and
analyze the dynamics it creates. Code and at available at
https://github.com/lil-lab/phrase_grounding.
☆ Zero-Resource Hallucination Prevention for Large Language Models
The prevalent use of large language models (LLMs) in various domains has
drawn attention to the issue of "hallucination," which refers to instances
where LLMs generate factually inaccurate or ungrounded information. Existing
techniques for hallucination detection in language assistants rely on intricate
fuzzy, specific free-language-based chain of thought (CoT) techniques or
parameter-based methods that suffer from interpretability issues. Additionally,
the methods that identify hallucinations post-generation could not prevent
their occurrence and suffer from inconsistent performance due to the influence
of the instruction format and model style. In this paper, we introduce a novel
pre-detection self-evaluation technique, referred to as {\method}, which
focuses on evaluating the model's familiarity with the concepts present in the
input instruction and withholding the generation of response in case of
unfamiliar concepts. This approach emulates the human ability to refrain from
responding to unfamiliar topics, thus reducing hallucinations. We validate
{\method} across four different large language models, demonstrating
consistently superior performance compared to existing techniques. Our findings
propose a significant shift towards preemptive strategies for hallucination
mitigation in LLM assistants, promising improvements in reliability,
applicability, and interpretability.
☆ Epi-Curriculum: Episodic Curriculum Learning for Low-Resource Domain Adaptation in Neural Machine Translation
Neural Machine Translation (NMT) models have become successful, but their
performance remains poor when translating on new domains with a limited number
of data. In this paper, we present a novel approach Epi-Curriculum to address
low-resource domain adaptation (DA), which contains a new episodic training
framework along with denoised curriculum learning. Our episodic training
framework enhances the model's robustness to domain shift by episodically
exposing the encoder/decoder to an inexperienced decoder/encoder. The denoised
curriculum learning filters the noised data and further improves the model's
adaptability by gradually guiding the learning process from easy to more
difficult tasks. Experiments on English-German and English-Romanian translation
show that: (i) Epi-Curriculum improves both model's robustness and adaptability
in seen and unseen domains; (ii) Our episodic training framework enhances the
encoder and decoder's robustness to domain shift.
♻ ☆ CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models
Lingyue Fu, Huacan Chai, Shuang Luo, Kounianhua Du, Weiming Zhang, Longteng Fan, Jiayi Lei, Renting Rui, Jianghao Lin, Yuchen Fang, Yifan Liu, Jingkuan Wang, Siyuan Qi, Kangning Zhang, Weinan Zhang, Yong Yu
With the emergence of Large Language Models (LLMs), there has been a
significant improvement in the programming capabilities of models, attracting
growing attention from researchers. We propose CodeApex, a bilingual benchmark
dataset focusing on the programming comprehension and code generation abilities
of LLMs. CodeApex comprises three types of multiple-choice questions:
conceptual understanding, commonsense reasoning, and multi-hop reasoning,
designed to evaluate LLMs on programming comprehension tasks. Additionally,
CodeApex utilizes algorithmic questions and corresponding test cases to assess
the code quality generated by LLMs. We evaluate 14 state-of-the-art LLMs,
including both general-purpose and specialized models. GPT exhibits the best
programming capabilities, achieving approximate accuracies of 50% and 56% on
the two tasks, respectively. There is still significant room for improvement in
programming tasks. We hope that CodeApex can serve as a reference for
evaluating the coding capabilities of LLMs, further promoting their development
and growth. Datasets are released at https://github.com/APEXLAB/CodeApex.git.
CodeApex submission website is https://apex.sjtu.edu.cn/codeapex/.
comment: 21 pages
♻ ☆ Single-Sentence Reader: A Novel Approach for Addressing Answer Position Bias
Machine Reading Comprehension (MRC) models tend to take advantage of spurious
correlations (also known as dataset bias or annotation artifacts in the
research community). Consequently, these models may perform the MRC task
without fully comprehending the given context and question, which is
undesirable since it may result in low robustness against distribution shift.
The main focus of this paper is answer-position bias, where a significant
percentage of training questions have answers located solely in the first
sentence of the context. We propose a Single-Sentence Reader as a new approach
for addressing answer position bias in MRC. Remarkably, in our experiments with
six different models, our proposed Single-Sentence Readers trained on biased
dataset achieve results that nearly match those of models trained on normal
dataset, proving their effectiveness in addressing the answer position bias.
Our study also discusses several challenges our Single-Sentence Readers
encounter and proposes a potential solution.
comment: 10 pages, 5 tables, 2 figures
♻ ☆ Learning Speech Representation From Contrastive Token-Acoustic Pretraining
For fine-grained generation and recognition tasks such as
minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic
speech recognition (ASR), the intermediate representations extracted from
speech should serve as a "bridge" between text and acoustic information,
containing information from both modalities. The semantic content is
emphasized, while the paralinguistic information such as speaker identity and
acoustic details should be de-emphasized. However, existing methods for
extracting fine-grained intermediate representations from speech suffer from
issues of excessive redundancy and dimension explosion. Contrastive learning is
a good method for modeling intermediate representations from two modalities.
However, existing contrastive learning methods in the audio field focus on
extracting global descriptive information for downstream audio classification
tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these
issues, we propose a method named "Contrastive Token-Acoustic Pretraining
(CTAP)", which uses two encoders to bring phoneme and speech into a joint
multimodal space, learning how to connect phoneme and speech at the frame
level. The CTAP model is trained on 210k speech and phoneme text pairs,
achieving minimally-supervised TTS, VC, and ASR. The proposed CTAP method
offers a promising solution for fine-grained generation and recognition
downstream tasks in speech processing.
♻ ☆ Attention-Driven Multi-Modal Fusion: Enhancing Sign Language Recognition and Translation
In this paper, we devise a mechanism for the addition of multi-modal
information with an existing pipeline for continuous sign language recognition
and translation. In our procedure, we have incorporated optical flow
information with RGB images to enrich the features with movement-related
information. This work studies the feasibility of such modality inclusion using
a cross-modal encoder. The plugin we have used is very lightweight and doesn't
need to include a separate feature extractor for the new modality in an
end-to-end manner. We have applied the changes in both sign language
recognition and translation, improving the result in each case. We have
evaluated the performance on the RWTH-PHOENIX-2014 dataset for sign language
recognition and the RWTH-PHOENIX-2014T dataset for translation. On the
recognition task, our approach reduced the WER by 0.9, and on the translation
task, our approach increased most of the BLEU scores by ~0.6 on the test set.
comment: This version has some errors. Our schedule is packed, so we don't
have enough time to correct it. We will share another work when we have time
to fix this
♻ ☆ An Empirical Analysis for Zero-Shot Multi-Label Classification on COVID-19 CT Scans and Uncurated Reports ICCV
Ethan Dack, Lorenzo Brigato, Matthew McMurray, Matthias Fontanellaz, Thomas Frauenfelder, Hanno Hoppe, Aristomenis Exadaktylos, Thomas Geiser, Manuela Funke-Chambour, Andreas Christe, Lukas Ebner, Stavroula Mougiakakou
The pandemic resulted in vast repositories of unstructured data, including
radiology reports, due to increased medical examinations. Previous research on
automated diagnosis of COVID-19 primarily focuses on X-ray images, despite
their lower precision compared to computed tomography (CT) scans. In this work,
we leverage unstructured data from a hospital and harness the fine-grained
details offered by CT scans to perform zero-shot multi-label classification
based on contrastive visual language learning. In collaboration with human
experts, we investigate the effectiveness of multiple zero-shot models that aid
radiologists in detecting pulmonary embolisms and identifying intricate lung
details like ground glass opacities and consolidations. Our empirical analysis
provides an overview of the possible solutions to target such fine-grained
tasks, so far overlooked in the medical multimodal pretraining literature. Our
investigation promises future advancements in the medical image analysis
community by addressing some challenges associated with unstructured data and
fine-grained multi-label classification.
comment: Proceedings of the IEEE/CVF International Conference on Computer
Vision (ICCV) Workshops 2023
♻ ☆ NNKGC: Improving Knowledge Graph Completion with Node Neighborhoods ISWC 2023
Knowledge graph completion (KGC) aims to discover missing relations of query
entities. Current text-based models utilize the entity name and description to
infer the tail entity given the head entity and a certain relation. Existing
approaches also consider the neighborhood of the head entity. However, these
methods tend to model the neighborhood using a flat structure and are only
restricted to 1-hop neighbors. In this work, we propose a node
neighborhood-enhanced framework for knowledge graph completion. It models the
head entity neighborhood from multiple hops using graph neural networks to
enrich the head node information. Moreover, we introduce an additional edge
link prediction task to improve KGC. Evaluation on two public datasets shows
that this framework is simple yet effective. The case study also shows that the
model is able to predict explainable predictions.
comment: DL4KG Workshop, ISWC 2023
♻ ☆ A Survey on Measuring and Mitigating Reasoning Shortcuts in Machine Reading Comprehension
The issue of shortcut learning is widely known in NLP and has been an
important research focus in recent years. Unintended correlations in the data
enable models to easily solve tasks that were meant to exhibit advanced
language understanding and reasoning capabilities. In this survey paper, we
focus on the field of machine reading comprehension (MRC), an important task
for showcasing high-level language understanding that also suffers from a range
of shortcuts. We summarize the available techniques for measuring and
mitigating shortcuts and conclude with suggestions for further progress in
shortcut research. Importantly, we highlight two concerns for shortcut
mitigation in MRC: (1) the lack of public challenge sets, a necessary component
for effective and reusable evaluation, and (2) the lack of certain mitigation
techniques that are prominent in other areas.
comment: 18 pages, 2 figures, 4 tables
♻ ☆ Substitution-based Semantic Change Detection using Contextual Embeddings
Measuring semantic change has thus far remained a task where methods using
contextual embeddings have struggled to improve upon simpler techniques relying
only on static word vectors. Moreover, many of the previously proposed
approaches suffer from downsides related to scalability and ease of
interpretation. We present a simplified approach to measuring semantic change
using contextual embeddings, relying only on the most probable substitutes for
masked terms. Not only is this approach directly interpretable, it is also far
more efficient in terms of storage, achieves superior average performance
across the most frequently cited datasets for this task, and allows for more
nuanced investigation of change than is possible with static word vectors.
♻ ☆ Layout and Task Aware Instruction Prompt for Zero-shot Document Image Question Answering
The pre-training-fine-tuning paradigm based on layout-aware multimodal
pre-trained models has achieved significant progress on document image question
answering. However, domain pre-training and task fine-tuning for additional
visual, layout, and task modules prevent them from directly utilizing
off-the-shelf instruction-tuning language foundation models, which have
recently shown promising potential in zero-shot learning. Contrary to aligning
language models to the domain of document image question answering, we align
document image question answering to off-the-shell instruction-tuning language
foundation models to utilize their zero-shot capability. Specifically, we
propose layout and task aware instruction prompt called LATIN-Prompt, which
consists of layout-aware document content and task-aware descriptions. The
former recovers the layout information among text segments from OCR tools by
appropriate spaces and line breaks. The latter ensures that the model generates
answers that meet the requirements, especially format requirements, through a
detailed description of task. Experimental results on three benchmarks show
that LATIN-Prompt can improve the zero-shot performance of instruction-tuning
language foundation models on document image question answering and help them
achieve comparable levels to SOTAs based on the pre-training-fine-tuning
paradigm. Quantitative analysis and qualitative analysis demonstrate the
effectiveness of LATIN-Prompt. We provide the code in supplementary and will
release the code to facilitate future research.
comment: Add the LATIN-Tuning for Alapca. Code is available at
https://github.com/WenjinW/LATIN-Prompt
♻ ☆ Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts
Large Language Models (LLMs) have achieved significant success across various
natural language processing (NLP) tasks, encompassing question-answering,
summarization, and machine translation, among others. While LLMs excel in
general tasks, their efficacy in domain-specific applications remains under
exploration. Additionally, LLM-generated text sometimes exhibits issues like
hallucination and disinformation. In this study, we assess LLMs' capability of
producing concise survey articles within the computer science-NLP domain,
focusing on 20 chosen topics. Automated evaluations indicate that GPT-4
outperforms GPT-3.5 when benchmarked against the ground truth. Furthermore,
four human evaluators provide insights from six perspectives across four model
configurations. Through case studies, we demonstrate that while GPT often
yields commendable results, there are instances of shortcomings, such as
incomplete information and the exhibition of lapses in factual accuracy.